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Abstract 

This report briefly discusses our preliminary performance experiments 
with parallel versions of OVERFLOW-D applications. These applications 
are based on MPI and hybrid paradigms on the IBM Power4 system here 
at the NAS Division. This work is part of an effort to determine the suit- 
ability of the system and its parallel libraries (MPI/OpenMP) for specific 
scientific computing objectives. 


1 Introduction 

The IBM Power4 system at the NAS Division is composed of two 32-way sym- 
metric multiprocessors (SMPs), with 32 GB each of central memory. Each 
SMP is contained within a single cabinet. This system is temporarily installed 
at NAS for a preliminary assessment of its suitability for high performance sci- 
entific computing. Specifically here, we describe the performance of NASA’s 
overset grid CFD application, OVERFLOW-D [2], on the IBM Power4 test- 
bed. OVERFLOW-D has been specialized for moving-body (dynamic) grid 
applications, and is based on a version of the NASA aerodynamic flow solver, 
OVERFLOW [3], developed mainly for static overset grid systems. Two paral- 
lel versions of OVERFLOW-D have been considered for our experiments. One 
is based on the MPI paradigm, referred to here as “overd-mpi” [4], and the 
second is based on the hybrid (MPI+OpenMP) paradigm, referred to here as 
“Overd-hybrid” [1]. Both versions have already been tested on an SGI 02K 
platform. 

2 Performance Results 

The test case used with the above applications has a grid system of about 
8 million grid points with 41 curvilinear grid blocks covering the flow domain. 
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Several runs have been made with both overd-mpi and overd-hybnd applications 
on the IBM P4 system using this same test case. The performance data below 
has been reported over a period of 20 time-steps. At the time of our experimen s 
one cabinet of the test bed, named ibm02, retained its original configuration 
(i e. one 32-way node), but the second cabinet, named ibmOl, was reconfigured 
as four 8-way nodes. All four nodes of ibmOl, named ibml-0{l,2,3,4}sa, were 
coupled with ’’Colony Switches”. Ibm02 and ibmOl were coupled over a gigabit 

ethernet. , , , . , ^ 

Our performance experiments on the ibm test-bed were conducted interac- 
tively and would fluctuate somewhat depending upon the concurrent usage of 
the system by other users. To minimize these effects, We have frequently mon- 
itored the system occupancy via the program, ’’topas”, and in some cases have 
had to repeat the runs several times. The best data has been reported here 
Nevertheless, because of the time constraint during these runs, a possible small 
margin of correction should be kept in mind. 


2.1 MPI Application 

The overd-mpi application was used for the experiments here. The code was 
compiled in 64-bit mode with following options: 

• F77 = mpxlf _r 

• CC = xlcjr 

• LINK = Id 

• FFLAGS = -03 -g -q64 -qhot -qnosave -qtune=pwr4 -qcache=auto 

• CFLAGS = -0 -g -q64 

Table 1 shows performance results of overd-mpi on IBM P4. The execution 
runtime (in seconds per time-step), denoted by T exec , consists of computation 
and communication timings and are averaged over the total number of MP 
processes, N M pi used. Runtimes are reported on ibm02 and ibmOl. Some runs 
are split across the two systems, indicated by ibm02+ibm01, and some axe split 
across the nodes of ibmOl. All the split jobs are characterized by the number of 
nodes used; the number of MPI processes are split equally between the nodes. 
The total number of processors, N pr0 c S , is equal to N MPI . Whenever possible 
the runtimes on IBM P4 are compared with similar runs on the tightly coupled 
SGI 02K machine. Results for 02K runs are taken from table 2 in reference [ J. 

As seen on the table, for the same number of processors, runtimes on the 
single node of ibm02 are shorter (i.e. more efficient) than similar runs on the 
multiple nodes of ibmOl and/or ibm02+ibm01 combined. The difference in 
the performances is mainly a reflection of the latency and communication time 
across processors. The inter-processors communication time on ibm02 is shorter 
due to the stronger interconnection, as compared with the communication time 
through the ’’colony switches” used in ibmOl or through the gigabit ethernet 
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Table 1: Comparison of OVERFLOW-D runtimes (in seconds) on IBM P4 and 
SGI 02K based on MPI programming model, using 8 million grid point test 


MPI Application 




IBM P4 | 

SGI 02K 


Nmpi 

Machine 

A nodes 

Texec 

Texec 

2 

2 

ibm02 

1 

15.0 

_ a 



ibmOl 

2 

15.8 

- 



ibm02-f ibmOl 

2 

16.3 

- 

4 

4 

ibm02 

1 

8.5 

31.4 



ibmOl 

4 

9.3 

- 



ibm02+ibm01 

2 

10.0 

- 

8 

8 

ibm02 

I 

4.3 

15.4 



ibmOl 

4 

4.8 

- 



ibm02-(-ibm01 

2 

6.1 

- 

16 

16 

ibm02 

1 

3.7 

9.0 



ibmOl 

4 

5.5 

- 



ibm02+ibm01 

2 

4.2 

- 

32 

32 

ibm02 

1 

3.4 

5.3 



ibmOl 

4 

3.8 

- 



ibm02+ibm01 

2 

4.5 



a Dash denotes “Not Applicable”, or data was not available. 


between ibm02 and ibmOl. Runs on ibm02 are 2.5 to 3.5 times faster than the 
runs on the 02K for 2 to 16 processors, but only about 1.5 times faster with 
32 processors. It should be noted that the latter runs suffer significantly from 
poor load balancing caused by assigning 41 grid blocks onto 32 MPI processes. 
The parallel scalablity on 02k is slightly better than on the IBM P4. 


2.2 Hybrid Application 

The overd-hybrid application was used for the following experiments. This code 
was similarly compiled in a 64-bit mode using the following compiler options: 

• F77 = mpxlf_r 

• CC = xlc_r 

• LINK = mpxlf Jr -q64 

• FFLAGS = -03 -g -q64 -qsmp=omp -qfixed -qnosave 

• CFLAGS = -0 -g -q64 

• LINKFLAGS = -qsmp 

It should be noted that the compilation of the code in a 64-bit mode with 
the compiler optimization option ”qhot”, together with the OpenMP option 
”qsmp=omp”, failed on one of our subroutines, while it compiled successfully 
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when ” qhot” was turned off. Furthermore, it was found that compilation of 
two other subroutines with ”qsmp” resulted in runtime unstable soluions, while 
without ”qsmp” the solutions were stable. The code was compiled without 
”qhot”, but with ”qsmp” for all runs except for those two subroutines using the 
options specified in the above list. 


Table 2: Parti; Comparison of OVERFLOW-D runtimes (in seconds) on IBM 
P4 and SGI 02K based on the hybrid programming model, using the 8 million 
grid point test case, with a total of 2 to 16 processors 


Hybrid Application 





IBM P4 | 

SGI 02K 

Total Nprocs 

Nmpi 

Nthrd 

Machine 

Nnodes 

Texec 

Tex ec 

2 

2 

1 

ibm02 

1 

18.2 

“ 




ibmOl 

2 

18.7 

- 




ibm02-fibm01 

2 

19.0 

- 

4 

4 

1 

ibmo2 

1 

10. 1 

24.6 




ibmol 

4 

10.6 

- 



• 

ibm02-bibm01 

2 

11.5 

- 


2 

2 

ibm02 

1 

10.5 

- 




ibmOl 

2 

10.1 

- 




ibm02+ibm01 

1 

10.6 

- 

8 

8 

1 

ibm02 

1 

6.0 

14.2 




ibmOl 

4 

5.8 

- 




ibm02+ibm01 

2 

6.7 

- 


4 

2 

ibm02 

1 

6.2 

17.2 




ibmOl 

4 

6.4 

- 




ibm02+ibm01 

2 

6.8 

- 


2 

4 

ibm02 

1 

5.9 

- 




ibmOl 

2 

6.1 

- 




ibm02-fibm01 

2 

6.0 

- 

16 

16 ! 

1 

ibm02 

1 

4.5 

9.6 


j 


ibmOl 

4 

5.6 

- 




ibm02+ibm01 

2 

4.7 

- 


8 

2 

ibm02 

1 

3.9 

10.5 




ibmOl 

4 

3.6 

- 




ibm02+ibm01 

2 

4.2 

- 


4 

4 

ibm02 

1 

3.7 

12.8 




ibmOl 

4 

3.8 

- 




ibm02+ibm01 

2 

4.2 

- 


2 

8 

ibm02 

1 

4.0 

14.6 




ibm02+ibm01 

2 

4.2 

- 


The performance results of the overd-hybrid application on IBM P4 are 
presented in two parts on Tables 2 and 3. The former shows results for Nprocs — 
2, 4, 8, and 16, and the latter for N procs = 32 and 64. These tables consist of 
similar data, as in Table 1, with an additional column entitled, Nthrd, that lists 
variations in the number of OpenMP threads used per each MPI process. The 
following relation holds, N pr0C s = Nmpi * N t hrd- 
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For a given value of N pr0 cs, variations of runs based on N M pi * N thr d have 
been reported, each reflects a different distribution and access of data m the 
memory. A g ain here, similar to the MPI results in §2.1, for multiple nodes, (i.e. 
N n0 des > 1). the number of MPI processes is equally split between the nodes. In 
comparison, runtimes on ibm02, N vr0C s up to 16, are two to three times faster 
than the similar runs on 02K, but not much faster for N pr0C s > 32. Similarly, 
runs on multinodes are slower than the corresponding runs on a single node, 

and their pertinent runtimes are 7 to 20 

In table 3, timing data which was unexpectedly slow for some runs is marked 
with ”?” on their right side. The exact cause of the problem could not be verified, 
but the conjecture is that the computational node was overloaded. It should 
be noted that these runs are all of the split type. For instance, for the run on 
ibmOl with N n odes — 2 to N proC s — 32, and Nmpi — 2 to Nthrd — two MPI 
processes are requested on ibmOl, one on the node ibml-Olsa, and one on lbm 
02sa. There are only 8 processors assigned with each of these nodes. However, 
16 OpenMP threads per each of these nodes are requested; the additional 8 
threads can only be provided by overloading the node. More detailed analysis 
of timings pertinent to the runs marked with ”?”, show the order of magnitude 
of increase in computational time on these nodes, which supports the ’’overload 
conjecture. 

3 Conclusions and Future Work 

We have conducted a priliminary performance analysis of a practical CFD ap- 
plication based on MPI and hybrid (MPI+OpenMP) paradigms on single and 
multiple IBM P4 nodes using an 8 million grid point test case. Our applications 
ran faster on IBM P4 relative to similar runs on 02K. Due to SMP’s nodal 
configuration and inter-nodal connection, the runtimes of the applications on 
multiple IBM nodes is 7 tO 20on a single node. The scalability performance 
of the applicaions on 02K appears somewhat better than on the IBM P4, but 
could not be quantified based on one test case and dataset. 

Future work should focus on the scaling performance of several multi-level 
hybrid applications based on multi-block structured overset grids, and also on 
unstructured grid systems. Application test cases should be selected from among 
various disciplines; CFD, climate modeling and molocular dynamics. Further- 
more, larger datasets should be used and tested on a larger number of IBM P4 

SMPS. ,, rc 
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Table 3: Part 2; Comparison of OVERFLOW-D runtimes (in seconds) on IBM 
P4 and SGI 02K based on the hybrid programming model, using the 8 million 
grid point test case and a total of 32 to 64 total processors 


Hybrid Code 





IBM P4 

SGI 02K 

Total N Procs 

Nmpi 

N thrd 

Machine 

N nodes 

Texec 

Texec 

32 

32 

1 

ibm02 

1 

2.8 

5.9 


32 

i 

ibmOl 

4 

3.0 

- 


32 

1 

ibm02+ibm01 

2 

4.0 

- 


16 

2 

ibm02 

1 

3.6 

5.9 


16 

2 

ibmOl 

4 

4.5 

- 


16 

2 

ibm02+ibm01 

2 

5.2? 

- 


8 

4 

ibm02 

1 

2.7 

6.4 


8 

4 

ibmOl 

4 

2.4 

- 


8 

4 

ibm02-f-ibm01 

2 

7.3? 

- 


4 

8 

ibm02 

1 

2.8 

10.1 


4 

8 

ibmOl 

4 

2.6 

- 


4 

8 

ibm02+ibm01 

2 

15.5? 

- 


2 

16 

ibm02 

1 

3.4 

- 


2 

16 

ibmOl 

2 

51.? 

- 


2 

16 

ibm02+ibm01 

2 

51.? 

- 

64 

32 

2 

- 

- 

- 

3.8 


32 

2 

ibm02+ibm01 

2 

5.4? 

- 


16 

4 

- 

- 

- 

3.8 


16 

4 

ibm024-ibm01 

2 

12.3? 

- 


8 

8 

- 

- 

- 

4.0 


8 

8 

ibm024-ibm01 

2 

21.0? 
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